System and method for language model personalization

ABSTRACT

A method, an electronic device, and computer readable medium is provided. The method includes identifying a set of observable features associated with one or more users. The method also includes generating latent features from the set of observable features. The method additionally includes sorting the latent features into one or more clusters. Each of the one or more clusters represents verbal utterances of a group of users that share a portion of the latent features. The method further includes generating a language model that corresponds to a specific cluster of the one or more clusters. The language model represents a probability ranking of the verbal utterances that are associated with the group of users of the specific cluster.

CROSS-REFERENCE TO RELATED APPLICATION AND CLAIM OF PRIORITY

This application claims priority under 35 U.S.C. § 119(e) to U.S.Provisional Patent Application No. 62/639,114 filed on Mar. 6, 2018. Theabove-identified provisional patent application is hereby incorporatedby reference in its entirety.

TECHNICAL FIELD

This disclosure relates generally to electronic devices. Morespecifically, this disclosure relates to generating personalizedlanguage models for automatic speech recognition.

BACKGROUND

Methods are interacting with and controlling a computing device arecontinually improving in order to conform to more natural approaches.Many such methods for interacting with and controlling a computingdevice generally require a user to utilize a user interface instrumentsuch as a keyboard, a mouse, or if the screen is a touch screen, a usercan physically touch the screen itself to provide an input. Certainelectronic devices employ voice-enabled user interfaces for enabling auser to interact with a computing device. Natural language usage isbecoming the interaction method of choice with certain electronicdevices and appliances. A smooth transition from natural language to theintended interaction can play an increasingly important role in consumersatisfaction.

SUMMARY

This disclosure provides a system and method for contextualizingautomatic speech recognition.

In one embodiment, a method is provided. The method includes identifyinga set of observable features associated with one or more users. Themethod also includes generating a set of latent features from the set ofobservable features. The method additionally includes sorting the latentfeatures into one or more clusters, each of the one or more clustersrepresenting verbal utterances of a group of users that share a portionof the latent features. The method further includes generating alanguage model that corresponds to a specific cluster of the one or moreclusters. The language model represents a probability ranking of theverbal utterances that are associated with the group of users of thespecific cluster.

In another embodiment, an electronic device is provided. The electronicdevice includes a processor. The processor is configured to identify aset of observable features associated with one or more users. Theprocessor is also configured to generate a set of latent features fromthe set of observable features. The processor is additionally configuredto sort the latent features into one or more clusters, each of the oneor more clusters representing verbal utterances of a group of users thatshare a portion of the latent features. The processor is furtherconfigured to generate a language model that corresponds to a specificcluster of the one or more clusters. The language model represents aprobability ranking of the verbal utterances that are associated withthe group of users of the specific cluster.

In another embodiment, a non-transitory computer readable mediumembodying a computer program is provided. The computer program includescomputer readable program code that, when executed by a processor of anelectronic device, causes the processor to identify a set of observablefeatures associated with one or more users; generate a set of latentfeatures from the set of observable features; sort the latent featuresinto one or more clusters, each of the one or more clusters representingverbal utterances of a group of users that share a portion of the latentfeatures; and generate a language model that corresponds to a specificcluster of the one or more clusters, the language model representing aprobability ranking of the verbal utterances that are associated withthe group of users of the specific cluster.

Other technical features may be readily apparent to one skilled in theart from the following figures, descriptions, and claims.

Before undertaking the DETAILED DESCRIPTION below, it may beadvantageous to set forth definitions of certain words and phrases usedthroughout this patent document. The term “couple” and its derivativesrefer to any direct or indirect communication between two or moreelements, whether or not those elements are in physical contact with oneanother. The terms “transmit,” “receive,” and “communicate,” as well asderivatives thereof, encompass both direct and indirect communication.The terms “include” and “comprise,” as well as derivatives thereof, meaninclusion without limitation. The term “or” is inclusive, meaningand/or. The phrase “associated with,” as well as derivatives thereof,means to include, be included within, interconnect with, contain, becontained within, connect to or with, couple to or with, be communicablewith, cooperate with, interleave, juxtapose, be proximate to, be boundto or with, have, have a property of, have a relationship to or with, orthe like. The term “controller” means any device, system or part thereofthat controls at least one operation. Such a controller may beimplemented in hardware or a combination of hardware and software and/orfirmware. The functionality associated with any particular controllermay be centralized or distributed, whether locally or remotely. Thephrase “at least one of,” when used with a list of items, means thatdifferent combinations of one or more of the listed items may be used,and only one item in the list may be needed. For example, “at least oneof: A, B, and C” includes any of the following combinations: A, B, C, Aand B, A and C, B and C, and A and B and C.

Moreover, various functions described below can be implemented orsupported by one or more computer programs, each of which is formed fromcomputer readable program code and embodied in a computer readablemedium. The terms “application” and “program” refer to one or morecomputer programs, software components, sets of instructions,procedures, functions, objects, classes, instances, related data, or aportion thereof adapted for implementation in a suitable computerreadable program code. The phrase “computer readable program code”includes any type of computer code, including source code, object code,and executable code. The phrase “computer readable medium” includes anytype of medium capable of being accessed by a computer, such as readonly memory (ROM), random access memory (RAM), a hard disk drive, acompact disc (CD), a digital video disc (DVD), or any other type ofmemory. A “non-transitory” computer readable medium excludes wired,wireless, optical, or other communication links that transporttransitory electrical or other signals. A non-transitory computerreadable medium includes media where data can be permanently stored andmedia where data can be stored and later overwritten, such as arewritable optical disc or an erasable memory device.

Definitions for other certain words and phrases are provided throughoutthis patent document. Those of ordinary skill in the art shouldunderstand that in many if not most instances, such definitions apply toprior as well as future uses of such defined words and phrases.

BRIEF DESCRIPTION OF THE DRAWINGS

For a more complete understanding of the present disclosure and itsadvantages, reference is now made to the following description taken inconjunction with the accompanying drawings, in which like referencenumerals represent like parts:

FIG. 1 illustrates an example communication system in accordance withembodiments of the present disclosure;

FIG. 2 illustrates an example electronic device in accordance with anembodiment of this disclosure;

FIG. 3 illustrates an example electronic device in accordance with anembodiment of this disclosure;

FIGS. 4A and 4B illustrate an automatic speech recognition system inaccordance with an embodiment of this disclosure;

FIG. 4C illustrates a block diagram of an example environmentarchitecture, in accordance with an embodiment of this disclosure;

FIGS. 5A, 5B, and 5C illustrate an example auto-encoder in accordancewith an embodiment of this disclosure;

FIG. 6A illustrates an example process for creating multiplepersonalized language models in accordance with an embodiment of thisdisclosure;

FIG. 6B illustrates an example cluster in accordance with an embodimentof this disclosure;

FIG. 7 illustrates an example process for creating a personalizedlanguage model for a new user in accordance with an embodiment of thisdisclosure; and

FIG. 8 illustrates an example method determining an operation to performbased on contextual information, in accordance with an embodiment ofthis disclosure.

DETAILED DESCRIPTION

FIGS. 1 through 8, discussed below, and the various embodiments used todescribe the principles of the present disclosure in this patentdocument are by way of illustration only and should not be construed inany way to limit the scope of the disclosure. Those skilled in the artwill understand that the principles of the present disclosure may beimplemented in any suitably-arranged system or device.

According to embodiments of the present disclosure, various methods forcontrolling and interacting with a computing device are provided.Graphical user interfaces allow a user interact with an electronicdevice, such as a computing device, by enabling a user to locate andselect objects on a screen. Common interactions include physicalmanipulations, such as, a user physically moving a mouse, typing on akeyboard, touching a touch screen of a touch sensitive surface, amongothers. There are instances when utilizing various physical interactionsuch as touching a touchscreen are not feasible, such as when a userwears a head mounted display, or if the device does not include adisplay, and the like. Additionally, there are instances when utilizingvarious physical interactions such as touching a touchscreen or using anaccessory (such as a keyboard, mouse, touch pad, remote, or the like) isinconvenient or cumbersome. Embodiments of the present disclosure alsoallow for additional approaches to interact with an electronic device.It is noted that as used herein, the term “user” may denote a human oranother device (such as an artificial intelligent electronic device)using the electronic device.

An electronic device, according to embodiments of the presentdisclosure, can include personal computers (such as a laptop, adesktop), a workstation, a server, a television, an appliance, and thelike. Additionally, the electronic device can be at least one of a partof a piece of furniture or building/structure, an electronic board, anelectronic signature receiving device, a projector, or a measurementdevice. In certain embodiments, the electronic device can be a portableelectronic device such as a portable communication device (such as asmartphone or mobile phone), a laptop, a tablet, an electronic bookreader (such as an e-reader), a personal digital assistants (PDAs), aportable multimedia player (PMPs), a MP3 player, a mobile medicaldevice, a virtual reality headset, a portable game console, a camera,and a wearable device, among others. The electronic device is one or acombination of the above-listed devices. Additionally, the electronicdevice as disclosed herein is not limited to the above-listed devices,and can include new electronic devices depending on the development oftechnology.

According to embodiments of the present disclosure, a natural approachto interacting with and controlling a computing device is a voiceenabled user interface. Voice-enabled user interfaces enable a user tointeract with a computing device through the act of speaking. Speakingcan include a human speaking directly to the electronic device oranother electronic device projecting sound through a speaker. Once thecomputing device detects and receives the sound, the computing devicecan derive contextual meaning from the oral command and thereafterperform the requested task.

Certain automatic speech recognition (ASR) systems enable therecognition and translation of spoken language into text on a computingdevice, such as speech to text. Additionally, ASR systems also caninclude a user interface that that performs one or more functions oractions based on the specific instructions received from the user. Forexample, if a user recited “call spouse” to a telephone, the phone caninterpret the meaning of the user, by looking up a phone numberassociated with ‘spouse,’ and dial the phone number associated with the‘spouse’ of the user. Similarly, if a user verbally spoke “call spouse”to a smart phone, the smart phone can identify the task as a request touse the phone function and activate the phone feature of the device,looking up a phone number associated with ‘spouse,’ and subsequentlydial the phone number of the spouse of the user. In another example, auser can speak “what is the weather,” to a particular device, and thedevice, and the device can look up the weather based on the location ofthe user, and either display the weather on a display or speak theweather to the user through a speaker. In another example, a user canrecite “turn on the TV,” to an electronic device, and a particular TVwill turn on.

Embodiments of the present disclosure recognize and take intoconsideration that certain verbal utterances are more likely to bespoken than others. For example, certain verbal utterances are morelikely to be spoken than others based on the context. Therefore,embodiments of the present disclosure provide systems and methods thatassociate context with particular language models to derive an improvedASR system. In certain embodiments, context can include (i) domaincontext, (ii) dialog flow context, (iii) user profile context, (iv)usage log context, (v) environment and location context, and (vi) devicecontext. Domain context indicates the subject matter of the verbalutterance. For example, if the domain is music, a user is more likely tospeak a song name, an album name, an artist name. Domain flow context isbased on the context of the conversation itself. For example, if theuser speaks “book a flight to New York,” the electronic device canrespond by saying “when.” The response by the user to the electronicdevice specifying a particular date is in response to the question bythe electronic device, and not an unrelated utterance. User profilecontext can associate vernacular and pronunciation that is associatedwith a particular user. For example, based on the age, gender, locationand other biographical information, a user is more likely to speakcertain words than others. For instance based on the location of theuser the verbal utterance of “ya'll” is more common than the utteranceof “you guys.” Similarly, based on the location of the user the verbalutterance of “traffic circle” is more common than “round-a-bout,” eventhough the both utterances refer to the same object. Usage logs indicatea number of frequently used commands. For example, based on usage logs,if a verbal utterance is common, the user is more likely to use the samecommand again. Environment and location of the user assist theelectronic device to understand accents or various pronunciations ofsimilar words. The device context indicates the type of electronicdevice. For example, if the electronic device is a phone, or anappliance, the verbal utterances of the user can vary. Moreover, thecontext is based on identified interests of the user and creating apersonalized language model that indicates a probability that certainverbal utterances are more likely to be spoken than others, based on theindividual user.

Embodiments of the present disclosure also take into consideration thatcertain language models can include various models for different groupsin the population. Such models do not discover interdependences betweencontextual features as well as latent features that are associated witha particular user. For example, a language model can be trained in orderto learn how the English language (or any other language) behaves. Alanguage model can so be domain specific, such as a specificgeographical or regional area for specific persons. Therefore,embodiments of the present disclosure provide a contextual ASR systemthat uses data from various aspects, such as different contexts, toprovide a rescoring of utterances for greater accuracy and understandingby the computing device.

Embodiments of the present disclosure provide systems and methods forcontextualizing ASR systems by building personalized language models. Alanguage model is a probability distribution of sequences of words. Forexample, a language model estimates by relative likelihood of differentphrases for natural language processing that is associated with ASRsystems. For example, in an ASR system, the electronic device attemptsto match sounds with word sequences. A language model provides contextto distinguish between words and phrases that sound similar. In certainembodiments, separate language models can be generated for each group ina population. Grouping can be based on observable features.

Additionally, embodiments of the present disclosure provide systems andmethods for generating a language model that leverages latent featuresthat are extracted from user profiles and usage patterns. User profilesand usage patterns are an example of observable features. Observablefeatures can include both classic features and augmented features. Incertain embodiments, observable features include both.

According to embodiments of the present disclosure, personalizedlanguage models improve speech recognition, such as those associatedwith ASR systems. The personalized language models can also improvevarious predictive user inputs, such as a predictive keyboard and smartautocorrect functions. The personalized language models can also improvepersonalized machine translation systems as well as personalizedhandwriting recognition systems.

FIG. 1 illustrates an example computing system 100 according to thisdisclosure. The embodiment of the system 100 shown in FIG. 1 is forillustration only. Other embodiments of the system 100 can be usedwithout departing from the scope of this disclosure.

The system 100 includes a network 102 that facilitates communicationbetween various components in the system 100. For example, the network102 can communicate Internet Protocol (IP) packets, frame relay frames,Asynchronous Transfer Mode (ATM) cells, or other information betweennetwork addresses. The network 102 includes one or more local areanetworks (LANs), metropolitan area networks (MANs), wide area networks(WANs), all or a portion of a global network such as the Internet, orany other communication system or systems at one or more locations.

The network 102 facilitates communications between a server 104 andvarious client devices 106-114. The client devices 106-114 may be, forexample, a smartphone, a tablet computer, a laptop, a personal computer,a wearable device, a head-mounted display (HMD), or the like. The server104 can represent one or more servers. Each server 104 includes anysuitable computing or processing device that can provide computingservices for one or more client devices, such as the client devices106-114. Each server 104 could, for example, include one or moreprocessing devices, one or more memories storing instructions and data,one or more network interfaces facilitating communication over thenetwork 102. In certain embodiments, the server 104 is an ASR systemthat can identify verbal utterances of a user. In certain embodiments,the server generates language models, and provides the language model toone of the client devices 106-114 to that perform the ASR. Each of thegenerated language models can be adaptively used in any of the clientdevices 106-114. In certain embodiments, the server 104 can include aneural network such as an auto-encoder that derives latent features froma set of observable features that are associated with a particular user.Additionally, in certain embodiments, the server 104 can derive latentfeatures from a set of observable features.

Each client device 106-114 represents any suitable computing orprocessing device that interacts with at least one server (such asserver 104) or other computing device(s) over the network 102. In thisexample, the client devices 106-114 include a desktop computer 106, amobile telephone or mobile device 108 (such as a smartphone), a personaldigital assistant (PDA) 110, a laptop computer 112, and a tabletcomputer 114. However, any other or additional client devices could beused in the system 100. A smartphones represent a class of mobiledevices 108 that are a handheld device with a mobile operating systemand an integrated mobile broadband cellular network connection forvoice, short message service (SMS), and internet data communication. Asdescribed in more detail below, an electronic device (such as the mobiledevice 108, PDA 110, laptop computer 112, and the tablet computer 114)can include a user interface engine that modifies one or more userinterface buttons displayed to a user on a touchscreen.

In this example, some client devices 108-114 communicate indirectly withthe network 102. For example, the client devices 108 and 110 (mobiledevices 108 and PDA 110, respectively) communicate via one or more basestations 116, such as cellular base stations or eNodeBs (eNBs). Also,the client devices 112 and 114 (laptop computer 112 and tablet computer114, respectively) communicate via one or more wireless access points118, such as IEEE 802.11 wireless access points. Note that these are forillustration only and that each client device 106-114 could communicatedirectly with the network 102 or indirectly with the network 102 via anysuitable intermediate device(s) or network(s).

In certain embodiments, the mobile device 108 (or any other clientdevice 106-114) transmits information securely and efficiently toanother device, such as, for example, the server 104. The mobile device108 (or any other client device 106-114) can trigger the informationtransmission between itself and server 104.

Although FIG. 1 illustrates one example of a system 100, various changescan be made to FIG. 1. For example, the system 100 could include anynumber of each component in any suitable arrangement. In general,computing and communication systems come in a wide variety ofconfigurations, and FIG. 1 does not limit the scope of this disclosureto any particular configuration. While FIG. 1 illustrates oneoperational environment in which various features disclosed in thispatent document can be used, these features could be used in any othersuitable system.

The processes and systems provided in this disclosure allow for a clientdevice to receive a verbal utterance from a user, and through an ASRsystem derive identify and understand the received verbal utterance fromthe user. In certain embodiments, the server 104 or any of the clientdevices 106-114 can generate a personalized language model the ASRsystem of a client device 106-114 to derive identify and understand thereceived verbal utterance from the user.

FIGS. 2 and 3 illustrate example devices in a computing system inaccordance with an embodiment of this disclosure. In particular, FIG. 2illustrates an example server 200, and FIG. 3 illustrates an exampleelectronic device 300. The server 200 could represent the server 104 inFIG. 1, and the electronic device 300 could represent one or more of theclient devices 106-114 in FIG. 1.

The server 200 can represent one or more local servers, one or moreremote servers, a clustered computers and components that act as asingle pool of seamless resources, a cloud based server, a neuralnetwork, and the like. The server 200 can be accessed by one or more ofthe client devices 106-114.

As shown in FIG. 2, the server 200 includes a bus system 205 thatsupports communication between at least one processing device 210, atleast one storage device(s) 215, at least one communications interface220, and at least one input/output (I/O) unit 225.

The processing device 210, such as a processor, executes instructionsthat can be stored in a memory 230. The processing device 210 caninclude any suitable number(s) and type(s) of processors or otherdevices in any suitable arrangement. Example types of the processingdevices 210 include microprocessors, microcontrollers, digital signalprocessors, field programmable gate arrays, application specificintegrated circuits, and discreet circuitry.

The memory 230 and a persistent storage 235 are examples of storagedevices 215 that represent any structure(s) capable of storing andfacilitating retrieval of information (such as data, program code, orother suitable information on a temporary or permanent basis). Thememory 230 can represent a random access memory or any other suitablevolatile or non-volatile storage device(s). The persistent storage 235can contain one or more components or devices supporting longer-termstorage of data, such as a ready only memory, hard drive, Flash memory,or optical disc.

The communications interface 220 supports communications with othersystems or devices. For example, the communications interface 220 couldinclude a network interface card or a wireless transceiver facilitatingcommunications over the network 102. The communications interface 220can support communications through any suitable physical or wirelesscommunication link(s).

The I/O unit 225 allows for input and output of data. For example, theI/O unit 225 can provide a connection for user input through a keyboard,mouse, keypad, touchscreen, or other suitable input device. The I/O unit225 can also send output to a display, printer, or other suitable outputdevice.

Note that while FIG. 2 is described as representing the server 104 ofFIG. 1, the same or similar structure could be used in one or more ofthe various client devices 106-114. For example, a desktop computer 106or a laptop computer 112 could have the same or similar structure asthat shown in FIG. 2.

In certain embodiments, the server 200 is an ASR system that includes aneural network such as an auto-encoder. In certain embodiments, theauto-encoder is included in an electronic device, such as the electronicdevice 300 of FIG. 3. The server 200 is able to derive latent featuresfrom observable features that are associated with users. In certainembodiments, the server 200 is also able to generate multiple languagemodels based on derived latent features. The multiple language modelsare then used to generate a personalized language model for a particularuser. In certain embodiments, the personalized language model isgenerated by the server 200 or a client device, such as the clientdevices 106-114 of FIG. 1. It should be noted that the multiple languagemodels can also be generated on any of the client devices 106-114 ofFIG. 1.

A neural network is a combination of hardware and software that ispatterned after the operations of neurons in a human brain. Neuralnetwork can solve and extract information from complex signalprocessing, pattern recognition, or pattern production. Patternrecognition includes the recognition of objects that are seen, heard,felt, and the like.

Neural networks process can handle information differently. For example,a neural network has a parallel architecture. In another example,information is represented, processed, and stored by a neural networkvaries from a conventional computer. The inputs to a neural network areprocessed as patterns of signals that are distributed over discreteprocessing elements, rather than binary numbers. Structurally, a neuralnetwork involves a large number of processors that operate in paralleland arranged in tiers. For example, the first tier receives raw inputinformation and each successive tier receives the output from thepreceding tier. Each tier is highly interconnected, such that each nodein tier n can be connected to multiple nodes in tier n−1 (such as thenodes inputs) and in tier n+1 that provides input for those nodes. Eachprocessing node includes a set of rules that it was originally given ordeveloped for itself over time.

For example, a neural network can recognize patterns in sequences ofdata. For instance, a neural network can recognize a pattern fromobservable features associated with one user or many users. The neuralnetwork can analyze the observable features and derive from theobservable features, latent features.

The architectures of a neural network provide that each neuron canmodify the relationship between inputs and outputs by some rule. Onetype of a neural network is a feed forward network in which informationis passed through nodes, but not touching the same node twice. Anothertype of neural network is a recurrent neural network. A recurrent neuralnetwork can include a feedback loop that allows a node to be providedwith past decisions. A recurrent neural network can include multiplelayers, in which each layer includes numerous cells called longshort-term memory (“LSTM”). A LSTM can include an input gate, an outputgates, and a forget gate. A single LSTM can remember a value over aperiod of times and can assist in preserving an error that can be backpropagated through the layers of the neural network.

Another type of a neural network is an auto-encoder. An auto-encoderderives, in an unsupervised manner, an efficient data coding. In certainembodiments, an auto-encoder learns a representation for a set of datafor dimensionality reduction. For example, an auto-encoder learns tocompress data from the input layer into a short code, and thenuncompressed that code into something that substantially matches theoriginal data

Neural networks can be adaptable such that a neural network can modifyitself as the neural network learns and performs subsequent tasks. Forexample, initially a neural network can be trained. Training involvesproviding specific input to the neural network and instructing theneural network what the output is expected. For example, a neuralnetwork can be trained to identify when to a user interface object is tobe modified. For example, a neural network can receive initial inputs(such as data from observable features). By providing the initialanswers, allows a neural network to adjust how the neural networkinternally weighs a particular decision to perform a given task. Theneural network is then able to derive latent features from theobservable features. In certain embodiments, the neural network can thenreceive feedback data that allows the neural network to continuallyimprove various decision making and weighing processes, in order toremove false positives and increase the accuracy and efficiency of eachdecision.

FIG. 3 illustrates an electronic device 300 in accordance with anembodiment of this disclosure. The embodiment of the electronic device300 shown in FIG. 3 is for illustration only and other embodiments couldbe used without departing from the scope of this disclosure. Theelectronic device 300 can come in a wide variety of configurations, andFIG. 3 does not limit the scope of this disclosure to any particularimplementation of an electronic device. In certain embodiments, one ormore of the devices 104-114 of FIG. 1 can include the same or similarconfiguration as electronic device 300.

In certain embodiments, the electronic device 300 is useable with datatransfer applications, such providing and receiving information from aneural network. In certain embodiments, the electronic device 300 isuseable user interface applications that can modify a user interfacebased on state data of the electronic device 300 and parameters of aneural network. The electronic device 300 can be a mobile communicationdevice, such as, for example, a mobile station, a subscriber station, awireless terminal, a desktop computer (similar to desktop computer 106of FIG. 1), a portable electronic device (similar to the mobile device108 of FIG. 1, the PDA 110 of FIG. 1, the laptop computer 112 of FIG. 1,and the tablet computer 114 of FIG. 1), and the like.

As shown in FIG. 3, the electronic device 300 includes an antenna 305, acommunication unit 310, a transmit (TX) processing circuitry 315, amicrophone 320, and a receive (RX) processing circuitry 325. Thecommunication unit 310 can include, for example, a RF transceiver, aBLUETOOTH transceiver, a WI-FI transceiver, ZIGBEE, infrared, and thelike. The electronic device 300 also includes a speaker 330, a processor340, an input/output (I/O) interface (IF) 345, an input 350, a display355, a memory 360, and a sensor(s) 365. The memory 360 includes anoperating system (OS) 361 one or more applications 362, and observablefeatures 363.

The communication unit 310 receives, from the antenna 305, an incomingRF signal transmitted such as a BLUETOOTH or WI-FI signal from an accesspoint (such as a base station, WI-FI router, Bluetooth device) of thenetwork 102 (such as a WI-FI, Bluetooth, cellular, 5G, LTE, LTE-A,WiMAX, or any other type of wireless network). The communication unit310 down-converts the incoming RF signal to generate an intermediatefrequency or baseband signal. The intermediate frequency or basebandsignal is sent to the RX processing circuitry 325 that generates aprocessed baseband signal by filtering, decoding, or digitizing thebaseband or intermediate frequency signal, or a combination thereof. TheRX processing circuitry 325 transmits the processed baseband signal tothe speaker 330 (such as for voice data) or to the processor 340 forfurther processing (such as for web browsing data).

The TX processing circuitry 315 receives analog or digital voice datafrom the microphone 320 or other outgoing baseband data from theprocessor 340. The outgoing baseband data can include web data, e-mail,or interactive video game data. The TX processing circuitry 315 encodes,multiplexes, digitizes, or a combination thereof, the outgoing basebanddata to generate a processed baseband or intermediate frequency signal.The communication unit 310 receives the outgoing processed baseband orintermediate frequency signal from the TX processing circuitry 315 andup-converts the baseband or intermediate frequency signal to an RFsignal that is transmitted via the antenna 305.

The processor 340 can include one or more processors or other processingdevices and execute the OS 361 stored in the memory 360 in order tocontrol the overall operation of the electronic device 300. For example,the processor 340 could control the reception of forward channel signalsand the transmission of reverse channel signals by the communicationunit 310, the RX processing circuitry 325, and the TX processingcircuitry 315 in accordance with well-known principles.

The processor 340 can execute instructions that are stored in a memory360. The processor 340 can include any suitable number(s) and type(s) ofprocessors or other devices in any suitable arrangement. For example, insome embodiments, the processor 340 includes at least one microprocessoror microcontroller. Example types of processor 340 includemicroprocessors, microcontrollers, digital signal processors, fieldprogrammable gate arrays, application specific integrated circuits, anddiscreet circuitry

The processor 340 is also capable of executing other processes andprograms resident in the memory 360, such as operations that receive,store, and timely instruct by providing ASR processing and the like. Theprocessor 340 can move data into or out of the memory 360 as required byan executing process. In some embodiments, the processor 340 isconfigured to execute plurality of applications 362 based on the OS 361or in response to signals received from eNBs or an operator. Example,applications 362 that include a camera application (for still images andvideos), a video phone call application, an email client, a social mediaclient, a SMS messaging client, a virtual assistant, and the like. Incertain embodiments, the processor 340 is configured to receive acquire,and derive the observable features 363. The processor 340 is alsocoupled to the I/O interface 345 that provides the electronic device 300with the ability to connect to other devices, such as client devices104-116. The I/O interface 345 is the communication path between theseaccessories and the processor 340.

The processor 340 is also coupled to the input 350 and the display 355.The operator of the electronic device 300 can use the input 350 to enterdata or inputs into the electronic device 300. Input 350 can be akeyboard, touch screen, mouse, track ball, voice input, or other devicecapable of acting as a user interface to allow a user in interact withelectronic device 300. For example, the input 350 can include voicerecognition processing thereby allowing a user to input a voice command.For another example, the input 350 can include a touch panel, a(digital) pen sensor, a key, or an ultrasonic input device. The touchpanel can recognize, for example, a touch input in at least one schemeamong a capacitive scheme, a pressure sensitive scheme, an infraredscheme, or an ultrasonic scheme. Input 350 can be associated withsensor(s) 365 and/or a camera by providing additional input to processor340. In certain embodiments, sensor 365 includes inertial measurementunits (IMU) (such as, accelerometers, gyroscope, and magnetometer),motion sensors, optical sensors, cameras, pressure sensors, heart ratesensors, altimeter, and the like. The input 350 can also include acontrol circuit. In the capacitive scheme, the input 350 can recognizetouch or proximity.

The display 355 can be a liquid crystal display (LCD), light-emittingdiode (LED) display, organic LED (OLED), active matrix OLED (AMOLED), orother display capable of rendering text and/or graphics, such as fromwebsites, videos, games, images, and the like.

The memory 360 is coupled to the processor 340. Part of the memory 360could include a random access memory (RAM), and another part of thememory 360 could include a Flash memory or other read-only memory (ROM).

The memory 360 can include persistent storage (not shown) thatrepresents any structure(s) capable of storing and facilitatingretrieval of information (such as data, program code, and/or othersuitable information on a temporary or permanent basis). The memory 360can contain one or more components or devices supporting longer-termstorage of data, such as a ready only memory, hard drive, flash memory,or optical disc. The memory 360 also can contain observable features 363that are received or derived from classic features as well as augmentedfeatures. Classic features include information derived or acquired formthe user profile, such as, the age of the user, the location of theuser, the education level of the user, the gender of the user, and thelike. Augmented features are acquired or derived from various otherservices or sources. For example, augmented features can includeinformation generated by the presence of a user on social media, emailsand SMS messages that are transmitted to and from the user, the onlinefootprint of the user, and usage logs of utterances (both verbal andelectronically inputted, such as typed), and the like.

An online footprint is the trail of data generated by the user while theuser accesses the Internet. For example, an online footprint of a userrepresents traceable digital activities, actions, contributions andcommunications that are manifested on the Internet. An online footprintcan include websites visited, internet search history, emails sent,information submitted to various online services. For example, when aperson visits a particular website, the website can save the IP addressthat identifies the person's internet service provider, the approximatelocation of the person. An online footprint can also include a reviewthe user provided to a product, service, restaurant, retailestablishment, and the like. An online footprint of a user can also blogpostings, social media postings,

Electronic device 300 further includes one or more sensor(s) 365 thatcan meter a physical quantity or detect an activation state of theelectronic device 300 and convert metered or detected information intoan electrical signal. For example, sensor 365 can include one or morebuttons for touch input, a camera, a gesture sensor, an IMU sensors(such as a gyroscope or gyro sensor and an accelerometer), an airpressure sensor, a magnetic sensor or magnetometer, a grip sensor, aproximity sensor, a color sensor, a bio-physical sensor, atemperature/humidity sensor, an illumination sensor, an Ultraviolet (UV)sensor, an Electromyography (EMG) sensor, an Electroencephalogram (EEG)sensor, an Electrocardiogram (ECG) sensor, an IR sensor, an ultrasoundsensor, an iris sensor, a fingerprint sensor, and the like. The sensor365 can further include a control circuit for controlling at least oneof the sensors included therein. Any of these sensor(s) 365 can belocated within the electronic device 300.

Although FIGS. 2 and 3 illustrate examples of devices in a computingsystem, various changes can be made to FIGS. 2 and 3. For example,various components in FIGS. 2 and 3 could be combined, furthersubdivided, or omitted and additional components could be addedaccording to particular needs. As a particular example, the processor340 could be divided into multiple processors, such as one or morecentral processing units (CPUs) and one or more graphics processingunits (GPUs). In addition, as with computing and communication networks,electronic devices and servers can come in a wide variety ofconfigurations, and FIGS. 2 and 3 do not limit this disclosure to anyparticular electronic device or server.

FIGS. 4A and 4B illustrate an example ASR system 400 in accordance withan embodiment of this disclosure. FIGS. 4A and 4B illustrate ahigh-level architecture, in accordance with an embodiment of thisdisclosure. FIG. 4B is a continuation of FIG. 4A. The embodiment of theASR system 400 shown in FIGS. 4A and 4B are for illustration only. Otherembodiments can be used without departing from the scope of the presentdisclosure.

The ASR system 400 includes various components. In certain embodiments,some of the components included in the ASR system 400 can in included ina single device, such as the mobile device 108 of FIG. 1 that includesinternal components similar to the electronic device 300 of FIG. 3. Incertain embodiments, a portion of the components included in the ASRsystem 400 can in included in two or more devices, such as the server104 of FIG. 1, which can include internal components similar to theserver 200 of FIG. 2, and the mobile device 108, which can includeinternal components similar to the electronic device 300 of FIG. 3. TheASR system 400 includes a received verbal utterance 402, featureextraction 404, an acoustic model 406, a general language model 408, apronunciation model 410, a decoder 412, a domain classifier 414, adomain specific language models 416, dialogue manager 418, devicecontext 420, observable features 422, an auto-encoder 428, acontextualizing module 430, and a generated output 432 of a personalizedlanguage model.

The verbal utterance 402 is an audio signal that is received by anelectronic device, such as the mobile device 108, of FIG. 1. In certainembodiments, the verbal utterance 402 can be created by a personspeaking to the electronic device 300, and a microphone (such as themicrophone 320 of FIG. 3) coverts the sound waves into electronicsignals that the mobile device 108can process. In certain embodiments,the verbal utterance 402 can be created by another electronic device,such as an artificial intelligent electronic device, sending anelectronic signal or generating noise through a speaker that is receivedby mobile device 108.

The feature extraction 404 preprocesses the verbal utterance 402. Incertain embodiments, the feature extraction 404 performs noisecancelation with respect to the received verbal utterance 402. Thefeature extraction 404 can also perform echo cancelation with respect tothe received verbal utterance 402. The feature extraction 404 can alsoextract features from the received verbal utterance 402. For example,using a Fourier Transform, the feature extraction 404 extracts variousfeatures from the verbal utterance 402. In another example, using aMel-Frequency Cepstral coefficients (MFCC), the feature extraction 404extracts various features from the verbal utterance 402. Since audio issusceptible to noise, the feature extraction 404 extracts specificfrequency components from the verbal utterance 402. For example, aFourier Transform transforms a time domain signal to frequency domain inorder to generate the frequency coefficients.

The acoustic model 406 generates a probabilistic models of relationshipsbetween acoustic features and phonetic units, such as phonemes and otherlinguistic units that comprise speech. The acoustic model 406 providesthe decoder 412 with the probabilistic relationships between theacoustic features and the phonemes. In certain embodiments, the acousticmodel 406 can receive the MFCC features that are generated in thefeature extraction 404 and then classify each frame as a particularphoneme. Each frame is a small portion of received verbal utterance 402,based on time. For example, a frame is a predetermined time duration ofthe received verbal utterance 402. A phoneme is a unit of sound. Forexample, the acoustic model 406 can convert the received verbalutterance 402, such as “SHE” into phoneme of “SH” and “IY.” In anotherexample, the acoustic model 406 can convert the received verbalutterance 402, such as “HAD” into phoneme of “HH,” “AA,” and “D.” Inanother example, the acoustic model 406 can convert the received verbalutterance 402, such as “ALL” into phoneme of “AO” and “L.”

The general language model 408 models word sequences. For example, thegeneral language model 408 determines the probability of a sequence ofwords. The general language model 408 provides the probability of whatword sequences are more likely that other word sequences. For example,the general language model 408 provides the decoder 412 variousprobability distributions that associated with a given sequence ofwords. The general language model 408 identifies the likelihood ofdifferent phrases. For example, based on context, the general languagemodel 408 can distinguish between words and phrases that sound similar.

The pronunciation model 410 maps words to phonemes. The mapping of wordsto phoneme can be statistical. The pronunciation model 410 convertsphoneme into words that are understood by the decoder 412. For example,pronunciation model 410 converts the phoneme of “HH,” “AA,” and “D” into“HAD.”

The decoder 412 receives (i) the probabilistic models of relationshipsbetween acoustic features and phonetic units from the acoustic model406, (ii) probability associated with particular sequence of words fromthe general language model 408, and (iii) the converted phoneme are canbe understood by the decoder 412. The decoder 412 searches for the bestword sequence based on a given acoustic signal.

The outcome of the decoder 412 is limited based on the probabilityrating of a sequence of words as determined by the general languagemodel 408. For example, the general language model 408 can represent oneor more language models that are trained to understand vernacular speechpatterns of a portion of the population. For example, the generallanguage model 408 is not based on a particular person, rather it isbased on a large grouping of persons that have different ages, genders,locations, interests, and the like.

Embodiments of the present disclosure take into consideration that theto increase the accuracy of the ASR system 400 the language model istailored to the user who is speaking or created the verbal utterance402, rather than a general person or group of persons. Based on thecontext, certain utterances are more likely than other utterances. Forexample, each person when speaking uses a slightly different thesequence of words. The changes can be based on the individuals age,gender, geographic location, interests, and speaking habits. Therefore,creating a language model that is unique to each person can improve theoverall outcome of the ASR system 400. In order to create a languagemodel, written examples, and verbal examples are needed. When a userenrolls in a new ASR system, very little information is known about theuser. Certain information can be learned based on a profile of the user,such as the age, gender, location, and the like. Generating a languagemodel that is tailored to a specific person identifies and then compareslatent features of the user to multiple language models. Based on thelevel of similarity between the latent features of the specific personand the various language models, a personalized language model can becreated for the particular user.

The decoder 412 derives a series of words based on a particular phonemesequence that corresponds to the highest probability. In certainembodiments, the decoder 412 can create a single output or a number oflikely sequences. If the decoder 412 outputs a number of likelysequences, the decoder 412 can also create a probability that isassociated with each sequence. To increase the accuracy of the outputfrom the decoder, a language model that is personalized to the speakerof the verbal utterance 402 can increase the series of words asdetermined by the decoder 412.

To create a language model for a particular person, the observablefeatures 422 are gathered for a particular person. Additionally, apersonalized language model is based on various contexts associated withthe verbal utterance of the user. For example, the decoder 412 can alsoprovide information to the domain classifier 414. Further, apersonalized language model can be based on the type of device thatreceives the verbal utterance, identified by the device context 420.

The domain classifier 414 is a classifier which identifies variouslanguage or audio features from the verbal utterance to determine thetarget domain for the verbal utterance. For example, the domainclassifier 414 can identify the domain context, such as the topicassociated with the verbal utterance. If the domain classifier 414identifies that the domain context is music, then the contextualizingmodule 430 will be able to determine that the next sequence of wordswill most likely be associated with music, such as an artist's name, analbums name, a song title, lyrics to a song, and the like. If the domainclassifier 414 identifies that the domain context is movies, then thecontextualizing module 430 will be able to determine that the nextsequence of words will most likely be associated with moves, such asactors, genres directors, movie titles, and the like. If the domainclassifier 414 identifies that the domain context is sports, then thecontextualizing module 430 will be able to determine that the nextsequence of words will most likely be associated with sports, such as atype of sport (football, soccer, hockey, basketball, and the like), aswell as athletes, commentators, to name a few. In certain embodiments,the domain classifier 414 is external to the ASR system 400.

The domain classifier 414 can output data into the domain specificlanguage models 416 and the dialogue manager 418. Language models withinthe domain specific language models 416 include langue models that aretrained using specific utterances from within a particular domaincontext. For example, the domain specific language models 416 includelangue models that are associated with specific domain contexts, such asmusic, movies, sports, and the like.

The dialogue manager 418 identifies the states of dialogue between theuser and the device. For example, the dialogue manager 418 can capturethe current action that is being executed to identify which parametershave been received and which are remaining. In certain embodiments, thedialogue manager 418 can also derive the grammar associated with theverbal utterance. For example, the dialogue manager 418 can derive thegrammar that is associated with each state in order to describe theexpected utterance. For example, if the ASR system 400 prompts the userfor a date, the dialogue manager 418 provides a high probability thatthe verbal utterance that is received from the user will be a date. Incertain embodiments, grammar that is derived from the dialogue manager418 are not be converted to a language model, as the contextualizingmodule 430 uses an indicator of a match of the verbal utterance with thederived language output.

The device context 420 identifies the type of device that receives theverbal utterance. For example, a personalized language model can bebased on the type of device that receives the verbal utterance. Exampledevices include a mobile phone, a TV, an appliance such as an oven, arefrigerator, and the like. For example, the verbal utterance of “TURNIT UP” when spoken to the TV can indicate that that user wants thevolume louder, whereas spoken to the oven, can indicate that thetemperature is to be higher.

The observable features 422 include classic features 424 and augmentedfeatures 426. The classic features 424 can include biographicalinformation about the individual (such as age, gender, location,hometown, and the like). The augmented features 426 can include featuresthat are acquired about the user, such as SMS text messages, socialmedia posts, written reviews, written blogs, logs, environment, context,and the like. The augmented features 426 can be derived by the onlinefootprint of the user. The augmented features 426 can also includederived interests of a user such as hobbies of the particular person. Incertain contexts (such as sports, football, fishing, cooking, ballet,gaming, music, motor boats, sailing, opera, and the like), various wordsequences can appear more than others based on each particular hobby orinterest. Analyzing logs of the user enables the language model toderive trends of what the particular person has spoken or written in thepast, which provides an indication as to what the user will possiblespeak in the future. The environment identifies where the user iscurrently. Persons that are in certain locations often speak withparticular accents, or use certain words when speaking. For example,regional differences can cause different pronunciations, and dialects.For example, “YA'LL” as compared to “YOU GUYS,” “POP” as compared to“SODA,” and the like. Context can include the subject matter associatedwith the verbal utterance 402 as well as whom the speaker of the verbalutterance 402 is directed to. For example, the context of the verbalutterance can change if the verbal utterance 402 is directed to anautomated system over a phone line, or to an appliance.

The observable features 422 can be gathered and represented as a singlemulti-dimensional vector. Each dimension of the vector could represent ameaning characteristic related to the user. For example, a single vectorcan indicate a gender of the user, a location of the user, interests ofthe user, based on the observable features 422. The vector thatrepresents the observable features 422 can encompass many dimensions dueto the vast quantities of information included in the observablefeatures 422 that are associated with a single user. Latent featuresthat are derived from the observable features 422 via the auto-encoder428. The latent features are latent contextual features that are basedon hidden similarities between users. The derived latent featuresprovide connections between two or more of the observable features 422.For example, a latent feature, derived by the auto-encoder 428, cancorrespond to a single dimension of the multi-dimensional vector. Thesingle dimension of the multi-dimensional vector can correspond tomultiple aspects of a person's personality. Latent features are learnedby training an auto-encoder, such as the auto-encoder 428, on observablefeatures, similar to the observable features 422.

The auto-encoder 428 performs unsupervised learning based on observablefeatures 422. The auto-encoder 428 is a neural network that is performsunsupervised learning of efficient coding. The auto-encoder 428 istrained to compress data from an input layer into a short code, and thenuncompressed that code into content that closely matches the originaldata. The short code represents the latent features. Compressing theinput creates the latent features that are hidden within the observablefeatures 422. The short code is compressed to a state, such that theauto-encoder 428 can reconstructs to input. As a result, the input andthe output of the auto-encoder 428 are substantially similar. Theauto-encoder compresses the information, such that multiple pieces ofinformation included in the observable features 422 are within a singlevector. Compressing the information included in the observable features422 into a lower diminution creates a meaningful representation thatincludes a hidden or latent meaning. The auto-encoder 428 is describedin greater detail below with respect to FIGS. 5A, 5B, and 5C.

The contextualizing module 430 selects the top-k hypothesis from thedecoder 412, and rescores the values. The values are rescored based onthe domain specific language (as identified by the domain specificlanguage models 416), the grammars for the current dialog state (fromthe dialogue manager 418), the personalized language model (as derivedvia the observable features 422, the auto-encoder 428), and the devicecontext 420. The contextualizing module 430 rescores the probabilitiesthat are associated with each sequence as identified by the decoder 412.

The contextualizing module 430 rescores the probabilities of from thedecoder 412. For example, the contextualizing module 430 rescores theprobabilities based on Equation 1, below:

$\begin{matrix}{{P_{LM}\left( W \middle| C \right)} \propto \frac{\prod\limits_{i = 1}^{k}{P_{LM}\left( W \middle| S_{i} \right)}}{{P_{LM}(W)}^{k - 1}}} & (1)\end{matrix}$

Equation 1 describes that the of a word sequence given by a subset S_(i)of various context elements. The element ‘W’ is the sentence hypothesis[W₀ . . . W_(k)], with W_(i) being the i^(th) word in the sequence ofthe expression C={C_(i)}_(i=1, . . . , N). The expression C_(i) ∈{domain, state, user profile, usage logs, environment, device, and thelike} and each S_(i) ⊂ C is a subset of C containing mutually dependentelements. The expression S_(i) and S_(j) are mutually independent ∀j≠i.For example, the expression S_(i)={location, weather} and the expressionS₂={age, gender}. As a result, the expression P_(LM) (W|S₁) representsthe probability of word sequence in the language model created from S₁,that of the location of the user and the weather.

If all the context elements are mutually independent, thecontextualizing module 430 rescores the probabilities based on Equation2, below:

$\begin{matrix}{{P_{LM}\left( W \middle| C \right)} \propto \frac{\prod\limits_{i = 1}^{k}{P_{LM}\left( W \middle| C_{i} \right)}}{{P_{LM}(W)}^{k - 1}}} & (2)\end{matrix}$

In Equation 2 above, P_(LM)(W|C) represents the probability of a wordsequence in the context of a specific language model, for context C_(i).For example, the expression P_(LM)(W|Domain) is the probability of aword sequence in the domain specific language model. The expression,P_(LM)(W|State) is the probability of a word sequence in the grammar ofthis state. Similarly, the expression P_(LM)(W|User Profile) is theprobability of a word sequence given the profile of the user. Theexpression P_(LM)(W|User Logs) is the probability of a word sequence inthe language model created from the usage logs of the user. Theexpression P_(LM)(W|Environment) is the probability of a word sequencein the language model for the current environment of the user. Theexpression P_(LM)(W|Device) is the probability of a word sequence in thelanguage model for the current device that the user is speaking to.

The output 432 from the contextualizing module 430 is the speechrecognition based on the personal language model of the user who createdthe verbal utterance 402.

FIG. 4C illustrates a block diagram of an example environmentarchitecture 450, in accordance with an embodiment of this disclosure.The embodiment of the environment architecture 450 is for illustrationonly. Other embodiments can be used without departing from the scope ofthe present disclosure.

Environment architecture 450 includes an electronic device 470communicating with a server 480 over network 460. The electronic device470 can be configured similar to any of the one or more client devices106-116 of FIG. 1, and can include internal components similar to thatof electronic device 300 of FIG. 3. The server 480 can be configuredsimilar to the server 104 of FIG. 1, and can include internal componentssimilar to that of server 200 of FIG. 2. The components, or a portion ofthe components of the server 480 can be included in electronic device470. A portion of the components of the electronic device 470 can beincluded in server 480. For example the sever 480 can generate thepersonalized language models as illustrated in FIG. 4C. Alternativelythe electronic device 470 can generate the personalized language models.For instance, either the electronic device 470 or the sever 480 caninclude an auto-encoder (such as the auto-encoder 428 of FIG. 4B) toidentify the latent features from a set of observable features 422 thatare associated with the user of the electronic device 470. After thelatent features are identified, the electronic device 470 or the server480 can create the personalized language models of the particular user.The electronic device 470 can also adaptively use language modelsprovided by the server 480 in order to create personalized languagemodels that are particular to the user of the electronic device 470.

The network 460 is similar to the network 102 of FIG. 1. In certainembodiments, the network 460 represents a “cloud” of computersinterconnected by one or more networks, where the network is a computingsystem utilizing clustered computers and components to act as a singlepool of seamless resources when accessed. In certain embodiments, thenetwork 460 is connected with one or more neural networks (such as theauto-encoder 428 of FIG. 4B), one or more servers (such as the server104 of FIG. 1), one or more electronic devices (such as any of theclient devices 106-116 of FIG. 1 and the electronic devices 470). Incertain embodiments, the network can be connected to an informationrepository, such as a database, that contains a look-up tables andinformation pertaining to various language models, and ASR systems,similar to the ASR system 400 of FIGS. 4A and 4B.

The electronic device 470 is an electronic device that can receive averbal utterance, such as the verbal utterance 402 of FIG. 4A, andperform a function based on the received verbal utterance. In certainembodiments, the electronic device 470 is a smart phone, similar to themobile device 108 of FIG. 1. For example, the electronic device 470 canreceive a verbal input and through an ASR system, similar to the ASRsystem 400 of FIGS. 4A and 4B, derive meaning from the verbal input andperform a particular function. The electronic device 470 includes areceiver 472, an information repository 474, and an natural languageprocessor 476.

The receiver 472 is similar to the microphone 320 of FIG. 3. Thereceiver 472 receives sound waves such as voice data and converts thesound waves into electrical signal. The voice data received from thereceiver 472 can be associated with the natural language processor 476which interprets one or more verbal utterances spoken by a user. Thereceiver 472 can be a microphone similar to a dynamic microphone, acondenser microphone, a piezoelectric microphone, or the like. Thereceiver 472 can also receive verbal utterances from another electronicdevice. For example, the other electronic device can include a speaker,similar to the speaker 330 of FIG. 3 which creates verbal utterances. Inanother example, the receiver 472 can receive a wired or wirelesssignals representing verbal utterances.

The information repository 474 can be similar to memory 360 of FIG. 3.The information repository 474 represents any structure(s) capable ofstoring and facilitating retrieval of information (such as data, programcode, or other suitable information on a temporary or permanent basis).The information repository 474 can include a memory and a persistentstorage. Memory can be RAM or any other suitable volatile ornon-volatile storage device(s), while persistent storage can contain oneor more components or devices supporting longer-term storage of data,such as a ROM, hard drive, Flash memory, or optical disc.

In certain embodiments, the information repository 474 includes theobservable features 422 of FIG. 4B and the observable features 363 ofFIG. 3. Information and content that is maintained in the informationrepository 474 can include the observable features 422 and apersonalized language module associated with a user of the electronicdevice 470. The observable features 422 can be maintained in a log andupdated at predetermined intervals. If the electronic device 470includes multiple users, then the observable features 422 associatedwith each user as well as the personalized language model that isassociated with each user can be included in the information repository474. In certain embodiments, the information repository 474 can includethe latent features derived via an auto-encoder (such as theauto-encoder 428 of FIG. 4B) based on the observable features 422.

The natural language processor 476 is similar to the ASR system 400 or aportion of the ASR system 400 of FIGS. 4A and 4B. The natural languageprocessor 476 allows a user to interact with the electronic device 470through sound such as voice and speech as detected by the receiver 472.The natural language processor 476 can include one or more processorsfor converting a user's speech into executable instructions. The naturallanguage processor 476 allows a user to interact with the electronicdevice 470 by talking to the device. For example, a user can speak acommand and the natural language processor 476 can extrapolate the soundwaves and perform the given command, such as through the decoder 412 ofFIG. 4A, and the contextualizing module 430 of FIG. 4B. In certainembodiments, the natural language processor 476 utilizes voicerecognition, such as voice biometrics, to identify the user based on avoice pattern of the user, in order to reduce, filter or eliminatecommands not originating from the user. Voice biometrics can select aparticular language model for the individual who spoke the verbalutterance, when multiple users can be associated with the sameelectronic device, such as the electronic device 470. The naturallanguage processor 476 can utilize a personalized language model toidentify from the received verbal utterances a higher probability of thesequence of words. In certain embodiments, the natural languageprocessor 476 can generate personalized language models based onpreviously created language models.

The personalized language models are a language models that are based onthe individual speakers. For example, the personalized language model isbased on interest of the user, as well as biographical data such as age,location, gender, and the like. In certain embodiments, the electronicdevice 470 can derive the interests of the user via an auto-encoder(such as the auto-encoder 428 of FIG. 4B, and the auto-encoder 500 ofFIG. 5A). An auto-encoder can derive latent features based on theobservable features (such as the observable features 422) that arestored in the information repository 474. The natural language processor476 uses a personalized language model for the speaker or user whocreated the verbal utterance for speech recognition. The personalizedlanguage model can be created for locally on the electronic device orremotely such as through the personalized language model engine 484 ofthe server 480. For example, based on the derived latent features of theuser, the personalized language model engine 484 generates a weightedlanguage model specific to the interest and biographical information ofthe particular user. In certain embodiments, the observable features422, the personalized language model, or a combination thereof arestored in an information repository that is external to the electronicdevice 470.

The server 480 can represent one or more local servers, one or morenatural language processing servers, one or more speech recognitionservers, one or more neural networks (such as an auto-encoder), or thelike. The server 480 can be a web server, a server computer such as amanagement server, or any other electronic computing system capable ofsending and receiving data. In certain embodiments, the server 480 is a“cloud” of computers interconnected by one or more networks, where theserver 480 is a computing system utilizing clustered computers andcomponents to act as a single pool of seamless resources when accessedthrough network 460. The server 480 can include a latent featuregenerator 482, a personalized language model engine 484, and aninformation repository.

The latent feature generator 482 is described in greater detail belowwith respect to FIGS. 5A, 5B, and 5C. In certain embodiments, the latentfeature generator 482 is a component of the electronic device 470. Thelatent feature generator 482 can receive observable features, such asthe observable features 422 from the electronic device 470. In certainembodiments, the latent feature generator 482 is a neural network. Forexample, the neural network can be an auto-encoder. The neutral networkuses unsupervised learning to encode the observable features 422 of aparticular user into latent representation of the observable features.In particular, the latent feature generator 482 identifies relationshipsbetween the observable features 422 of a user. For example, the latentfeature generator 482 derives patterns between two or more of theobservable features 422 associated with a user. In certain embodiments,the latent feature generator 482 compresses the input to a thresholdlevel such that the input is reconstructed, and the input and thereconstructed input are substantially the same. The compressed middlelayer represents the latent features.

The personalized language model engine 484 is described in greaterdetail below with respect to FIGS. 6A, 6B, and 7. The personalizedlanguage model engine 484 for each user, sorts the latent features for aparticular user into clusters. The personalized language model engine484 builds an information repository, such as the information repository486 that is associated with each cluster. Each of the informationrepository 486 can include verbal utterances from a number of differentusers that share the same cluster or share an overlapping cluster. Alanguage model can be constructed for each information repository 486that is associated with each cluster. That is, the language models arebuilt round clusters that were defined in spaces using latent features.The clusters can be map to a space that is defined by the latentfeatures.

The number of clusters that the personalized language model engine 484identifies can be is predetermined. For example, the personalizedlanguage model engine 484 can be configured to derive the apredetermined number of clusters. In certain embodiments, the quantityof clusters is data driven. For example, based on the quantity ofderived latent features from the latent feature generator 482, canindicate to the personalized language model engine 484 the number ofclusters. In another example, based on the number of identifiablegroupings of text can indicate the number of clusters.

The personalized language model engine 484 then builds a personalizedlanguage model for the users based on each user's individual latentfeatures, and text associated with each cluster. For example, a user canhave latent features that overlap one or more clusters that areassociated with a language model. The language models can be weightedand customized based on the magnitude of the latent features of theuser. For example, if the clusters of the individual indicate aninterest in sports, and a location in the New York City, N.Y., then thepersonalized language model engine 484 selects previously generatedlanguage models that are specific for those clusters, weights themaccording to the users individual clusters and generates a personalizedlanguage model for the user. The personalized language model can bestored in on the electronic device of the user, such as the informationrepository 474, or stored remotely in the information repository 486accessed via the network 460.

The information repository 486 is similar to the information repository474. Additionally, the information repository 486 can be similar tomemory 230 of FIG. 2. The information repository 486 represents anystructure(s) capable of storing and facilitating retrieval ofinformation (such as data, program code, or other suitable informationon a temporary or permanent basis). The information repository 486 caninclude a memory and a persistent storage. Memory can be RAM or anyother suitable volatile or non-volatile storage device(s), whilepersistent storage can contain one or more components or devicessupporting longer-term storage of data, such as a ROM, hard drive, Flashmemory, or optical disc. In certain embodiments, the informationrepository 486 includes the databases of verbal utterances associatedwith one or more clusters. The information repository 486 can alsoinclude cluster specific language models. The cluster specific languagemodels can be associated with a particular cluster, such as aninterests, age groups, geographic locations, genders, and the like. Forexample, a cluster specific language models can be a language model forpersons from a particular area, or an age range, or similar politicalpreferences, similar interests (such as sports, theater, TV shows,movies, music, among others, as well as sub-genres of each). The corpusof each of each databases of verbal utterances associated with one ormore clusters can be used to create, build, and train the variouslanguage models.

FIG. 5A illustrate an example auto-encoder 500 in accordance with anembodiment of this disclosure. FIGS. 5B and 5C illustrate differentcomponent aspects of the auto-encoder 500 in accordance with anembodiment of this disclosure. The embodiment of FIGS. 5A, 5B, and 5Care for illustration only. Other embodiments can be used withoutdeparting from the scope of the present disclosure.

The auto-encoder 500 is an unsupervised neural network. In certainembodiments, the auto-encoder 500 efficiently encodes high dimensionaldata. For example, the auto-encoder 500 compresses the high dimensionaldata to extract hidden or features. The auto-encoder 500 can be similarto the auto-encoder 428 of FIG. 4B, and the latent feature generator 482of FIG. 4C. The auto-encoder 500 includes an input 510 an output 520 andlatent features 530.

The auto-encoder 500 compresses the input 510 until a bottleneck thatyields the latent features 530 and then decompresses the latent features530 into the output 520. The output 520 and the input 510 aresubstantially the same. The latent features 530 are the input 510 ofobservable features that are compressed to a threshold such that whendecompressed, the input 510 and the output 520 are substantiallysimilar. If the compression of the input 510 is increased, then when thelatent features are decompressed, the output 520 and the input 510 arenot substantially similar due to the deterioration of the data from thecompression. In certain embodiments, the auto-encoder 500 is a neuralnetwork that is trained to generate the latent features 530 from theinput 510.

The input 510 represents the observable features, such as the observablefeatures 363 of FIG. 3, and the observable features 422 of FIG. 4B. Theinput 510 is split into two portions that of the classic features 512and the augmented features 514. The classic features 512 are similar tothe classic features 424 of FIG. 4B. The augmented features 514 aresimilar to the augmented features 426 of FIG. 4B.

In certain embodiments, the classic features 512 include various dataelements 512 a through data element 512 n (512 a-512 n). The dataelements 512 a-512 n represent biological data concerning a particularuser or individual. For example, the data element 512 a can representthe age of the user. In another example, the data element 512 b canrepresent the current location of the user. In another example, the dataelement 512 c can represent the location of where the user was born. Inanother example, the data element 512 d can represent the gender of theuser. Other data elements can represent the educational level of theuser, the device the user is currently using, the domain, the country,the language the user speaks, and the like.

The augmented features 514 include various data elements 514 a throughdata element 514 n (514 a-514 n). The data elements 514 a-514 nrepresent various aspects of the online footprint of a user. Forexample, one or more of the data elements 514 a-514 n can representvarious aspects of a user profile on social media. In another example,one or more of the data elements 514 a-514 n can represent variousmessages the user sent or received, such as through SMS, or othermessaging application. In another example, one or more of the dataelements 514 a-514 n can represent posts drafted by the user such as ona blog, a review, or the like.

The latent features 530 include various learned features that includedata elements 532 a through data element 532 n (532 a-532 n). The dataelements 532 a-532 n are the compressed representation of the dataelements 514 a-514 n. The auto-encoder 428, of FIG. 4B, is able toperform unsupervised neural network learning to generate an efficientencodings (the data elements 532 a-532 n) from the higher dimensionaldata, that of the data elements 514 a-514 n. The data elements 532 a-532n, represent a bottle neck encoding, such that the auto-encoder 428 canreconstruct input 510 to the output 520. The data elements 532 a-532 nare the combination of the classic features 512 and augmented features514. For example, the data elements 532 a-532 n include enoughinformation that the auto-encoder can create the output 520 thatsubstantially matches the input 510. That is a single dimension of thelatent features 530 (which includes the data elements 532 a-532 n) caninclude one or more classic features 512 and augmented features 514. Forexample, a single data element, such as the data element 532 b, caninclude a classic features 512 and augmented features 514 that arerelated to one another.

FIGS. 6A and 6B illustrate a process of creating multiple personalizedlanguage models. FIG. 6A illustrates an example process 600 for creatinglanguage models in accordance with an embodiment of this disclosure.FIG. 6B illustrates an example cluster 640 a in accordance with anembodiment of this disclosure. The embodiment of the process 600 and thecluster 640 a are for illustration only. Other embodiments could be usedwithout departing from the scope of the present disclosure.

The process 600 can performed by a server similar to the server 104 ofFIG. 1, the server 480 of FIG. 4C, and include internal componentssimilar to that of the server 200 of FIG. 2. The process 600 canperformed by a server similar to any of the client devices 106-114 ofFIG. 1, the electronic device 470 of FIG. 4C, and include internalcomponents similar to that of the electronic device 300 of FIG. 3. Theprocess 600 can include internal components similar to the ASR system400 of FIGS. 4A and 4B, respectively. The process 600 can be performedby the personalized language model engine 484 of FIG. 4C.

The process 600 includes observable features 610, an auto-encoder 620,latent features 630, clustering 640, information repositories 650 a, 650b through 650 n (collectively information repositories 650 a-650 n) andlanguage models 660 a, 660 b, and 660 n (collectively language models660 a-660 n). The process 600 illustrates the training and creation ofmultiple language models, such as the language models 660 a-660 n basedon the observable features 610. The language models 660 a-660 n are notassociated with a particular person or user, rather the language models660 a-660 n are associated with particular latent features.

The language models 660 a-660 n can be associated with particularsubject matter, or multiple subject matters. For example, the languagemodel 660 a can be associated with sports, while the language model 660b is associated with music. In another example, the language model 660 acan be associated with football, while the language model 660 b isassociated with soccer, and the language model 660 c is associated withbasketball. That is, if the cluster is larger than a threshold, alanguage model can be constructed for that particular subject. Forinstance, a language model can be constructed for sports, or if eachtype of sport is large enough, then specific language models can beconstructed for sports that are beyond a threshold. Similarly topics ofmusic, can include multiple genres, politics can include multipleparties, computing games can include different genres, platforms, andthe like. Individual language models can be constructed for each groupor subgroup based on the popularity of the subject as identified by thecorpus of text each cluster. It is noted that a cluster of pointsincludes similar properties. For example, a group of people who discusssports a similar topic can have various words that mean something to thegroup but have another meaning if the word is spoken in connection withanother group. Language models that are associated with a particularcluster can associate a higher probability of a word have a firstmeaning than another meaning, based on the cluster and the corpus ofwords that are associated with the particular latent feature.

In certain embodiments, the process 600 is performed prior to enrollingusers into the ASR system. For example, the process 600 creates multiplelanguage models that are specific a group of users that share a commonlatent feature. The multiple created language models can then betailored to users who enroll in the ASR system, in order to createpersonalized language models for each user. In certain embodiments, theprocess 600 is performed at predetermined intervals. Repeating thetraining and creation of langue models enables each language model toadapt to current vernacular that is associated with each latent feature.For example, new language models can be created based on the changes tothe verbal utterances and the observable features 610 associated withthe users of the ASR system.

The observable features 610 are similar to the observable features 363of FIG. 3, the observable features 422 of FIG. 4B, and the input 510 ofthe FIG. 5A. In certain embodiments, the observable features 610represent observable features for a corpus of users. That is, theobservable features 610 can be associated with multiple individuals. Incertain embodiments, the observable features 610 that are associatedwith multiple individuals, can be used to train the auto-encoder 620.The observable features includes both the classic features (such as theclassic features 424 of FIG. 4B) and the augmented features (such as theaugmented features 426 of FIG. 4B). Each of the elements within theobservable features 610 can be represented as a vector of amulti-dimensional vector.

The auto-encoder 620 is similar to the auto-encoder 428 of FIG. 4B andthe auto-encoder 500 of FIG. 5A. The auto-encoder 620 identifies thelatent features 630 from the observable features 610. The latentfeatures 630 can be represented as a multi-dimensional vector. It isnoted that the multi-dimensional latent feature vector, as derived bythe auto-encoder 620 can include a large number of dimensions. Themulti-dimensional latent feature vector includes less dimensional of themulti-dimensional observable feature vector. For example, themulti-dimensional latent feature vector can include over 100 dimensions,with each dimension representing a latent feature that is associatedwith one or more users.

Clustering 640 identifies groups of text associated with each latentfeature. Clustering 640 can identify cluttering of text such asillustrated in the example cluster 640 a of FIG. 6B. The cluster 640 adepicts three clusters, cluster 642, cluster 644 and cluster 646. Theclustering 640 plots the latent features 630 to identify a cluster. Eachcluster is centered on a centroid. The centroid is the position of thehighest weight of the latent features. Each point on the clustering 640can be a verbal utterance that is associated with a latent feature. Forexample, if each dimension of the clustering 640 corresponds to a latentfeature, each point represents a verbal utterance. A cluster can beidentified when of verbal utterances create a centroid. The clustering640 can be represented via two-dimensional graph or a multi-dimensionalgraph. For example, the cluster 640 a can be presented in amulti-dimensional graph, such that each axis of the cluster 640 a is adimension of the latent features 630.

In certain embodiments, the number of clusters can be identified basedon the data. For example, the latent features can be grouped intocertain identifiable groupings, and then each grouping is identified asa cluster. In certain embodiments, the number of clusters can be apredetermined number. For example, clustering 640 plots the latentfeatures 630 and identifies a predetermined number of clusters, based onthe size, density, or the like. If the predetermined number of clustersis three, clustering 640 identifies three centroids with the highestconcentration, such as the centroids of the cluster 642, the cluster 644and the cluster 646.

After clustering 640 of the latent features 630, the informationrepositories 650 a-650 n are generated. The information repositories 650a-650 n can be similar to the information repository 486 of FIG. 4C. Theinformation repositories 650 a-650 n represents verbal utterances thatare associated with each cluster. Using the corpus of text in the eachof the respective information repositories 650 a-650 n, the languagemodels 660 a-660 n are generated. The language models 660 a-660 n arecreated around clusters that are defined in spaces using the latentfeatures.

FIG. 7 illustrates an example process 700 for creating a personalizedlanguage model for a new user in accordance with an embodiment of thisdisclosure. The embodiment of the process 700 is for illustration only.Other embodiments could be used without departing from the scope of thepresent disclosure.

The process 700 can performed by a server similar to the server 104 ofFIG. 1, the server 480 of FIG. 4C, and include internal componentssimilar to that of the server 200 of FIG. 2. The process 700 canperformed by a server similar to any of the client devices 106-114 ofFIG. 1, the electronic device 470 of FIG. 4C, and include internalcomponents similar to that of the electronic device 300 of FIG. 3. Theprocess 700 can include internal components similar to the ASR system400 of FIGS. 4A and 4B, respectively. The process 700 can be performedby the personalized language model engine 484 of FIG. 4C.

The process 700 includes latent features of a new user 710, the cluster640 a (of FIG. 6B), a similarity measure module 720, a model adaptationengine 730 which uses the language models 660 a-660 n of FIG. 6B, and apersonalized language model 740. The personalized language model 740 isdefined based on the latent features of the new user 710.

When a new user joins the ASR system, the observable features, such asthe observable features 422 of FIG. 4B of the new user are gathered. Incertain embodiments, the personalized language model engine 484 of FIG.4C instructs the electronic device 470 to gather the observablefeatures. In certain embodiments, the personalized language model engine484 gathers the observable features of the user. Some of the observablefeatures can be identified when the user creates a profile with the ASRsystem. Some of the observable features can be identified based on theuser profile, SMS text messages of the user, social medial posts of theuser, reviews written by the user, blogs written by the user, the onlinefootprint of the user, and the like. An auto-encoder, similar to theauto-encoder 428 of FIG. 4B identifies the latent features of the newuser 710. In certain embodiments, the electronic device 470 can transmitthe observable features to an auto-encoder that is located remotely fromthe electronic device 470. In certain embodiments, the electronic device470 includes an auto-encoder that can identify the latent features ofthe new user 710.

The similarity measure module 720 receives the latent features of thenew user 710 and identifies levels of similarity between the latentfeature of the new user 710 and the clusters 642, 644, and 646 generatedby the clustering 640 of FIG. 6B. It is noted that more or less clusterscan be included in the cluster 640 a. In certain embodiments, thesimilarities are identified by a cosine similarity metric. In certainembodiments, the similarity measure module 720 identifies how similarthe user is to one or more clusters. In certain embodiments, thesimilarity measure module 720 includes an affinity metric. The affinitymetric defines a similarity of different clusters of a new user to thevarious clusters already identified such as those of the cluster 640 a

In certain embodiments, the similarity measure module 720 generates afunction 722 and forwards the function to the model adaptation engine730. The function 722 represents similar measure of the user to thevarious clusters 642, 644, and 646. For example, the function 722 can beexpressed as S(u, t_(i)). In the expression S(u, t_(i)), each cluster(clusters 642, 644, and 646) are identified by ‘t₁’, ‘t₂’, and ‘t₃’,respectively, and the latent features of the new user 710 is identifiedby ‘u’.

The model adaptation engine 730 combines certain language models togenerate a language model personalized for the user, ‘u,’ based on thefunction 722. The model adaptation engine 730 generates the personalizedlanguage model 740 based on probabilities and linear interpolation. Forexample, the model adaptation engine 730, identifies certain clustersthat are similar to the latent features of the user. The identifiedclusters can be expressed in the function, such as S(u, t_(i)), wheret_(i) represents the clusters most similar to the user. Each cluster isused to build particular language models 660 a-66 n. The modeladaptation engine 730 then weights each language model (language models660 a-66 n) based on the function 722 to create a personalized languagemodel 740. In certain embodiments, if one or more of the language models660 a-660 n are below a threshold, those language models are excludedand not used to create the personalized language model 740, by the modeladaptation engine 730.

Since each cluster (such as cluster 642) represents a group of personswho have similar interests (latent features), therefore a language modelbased a particular cluster will have a probability assigned to eachword. As a result, two language models, based on different clusters canhave different probabilities associated with similar words. Based on thesimilarity a user is to each cluster, the model adaptation engine 730combines the probabilities associated with the various words in therespective language models and assigns a unique weight for each word,thereby creating the personalized language model 740. For example, theprocess 700 can be expressed by Equation 3 below.

LM _(u) =F(LM ₁ , LM ₂ , . . . , LM _(n) , S(t ₁ , u), S(t ₂ , u), . . ., S(t _(n) , u))   (3)

Equation 3 describes the process 700 of enrolling a new user andcreating a personalized language model 740 for the new user. Anauto-encoder, similar to the auto-encoder 500 of FIG. 5A, obtains latentfeatures, expressed by variable ‘h.’ The latent vectors can be clusteredinto similar groups ‘C_(i,).’ The centroid of each cluster, ‘C_(i,),’ isdenoted by ‘t_(i).’ For each such cluster, a language LM_(i) is createdbased on all the text corpus corresponding to the points in the cluster‘C_(i)’ that is projected back into the original observable features(such as the observable features 610) and expressed by the variable ‘V.’Each of the variables LM represents particular a language models thatare constructed from a cluster. In certain embodiments, the function 722(S(t_(i), u)), is created by Equation 4 below. Similarly, the function‘F’ of Equation 3 is expressed by Equation 5 below. Additionally,Equation 6 below depicts the construction of a database that is used tocreate a corresponding language model.

$\begin{matrix}\frac{1}{d\left( {t_{1},u} \right)} & (4) \\{{P_{{LM}_{u}}(w)} = {\sum_{i}{{P\left( {U \in c_{i}} \right)} \cdot {P_{{LM}_{i}}(w)}}}} & (5) \\{{DB}_{C_{i}} = {\bigcup_{p_{i} \in \; C_{i}}{{DB}\left( p_{i} \right)}}} & (6)\end{matrix}$

Equation 4 denotes that the function 722 is obtained based on theinverse of d(t₁,u), where the function d(t₁,u) is the Euclidean distancebetween the vector ‘u,’ to the closest cluster t₁. Equation 5 representsthe function of Equation 3 which is used to create the personalizedlanguage model 740. The expression P(u ∈ C_(i)) ∝ S(t_(i), u), andP_(LM) _(i) (w) denote the probability of a given word ‘w’ based on thelanguage model ‘LM_(i.)’ For example, a general purpose language modelLM is based on the probabilities P_(LM)(w), where ‘w’ is the wordsequence hypothesis of [w₀, . . . , w_(k)], with ‘w_(i)’ being thecorresponding word in the sequence. For example, P_(LM) is theprobability that is associated with each word of a particular languagemodel, such as the language model ‘i.’ Similarly, LM_(u) is thepersonalized language model that is associated with a particular user,‘u,’ and such as the personalized language model 740. Equation 6 abovedenotes the creation of a database ‘DB’ for a particular cluster‘c_(i).’

In certain embodiments, once a personalized language model, such as thepersonalized language model 740, is constructed for a particular user, adynamic run-time contextual rescoring can be performed in order toupdate the personalized language model based on updates to the languagemodels (such as the language models 660 a-660 n). Dynamic run-timecontextual rescoring is expressed by Equation 7, below.

P_(LM)(W) ∝ P_(LM) _(U) (w)·P_(LM)(W|D)·P_(LM)(W|DC)·P_(LM)(W|DM)   (7)

The expressions P_(LM)(W|DM), P_(LM)(W|DC), and P_(LM)(W|D) denoteseparate probabilities' of a word sequence ‘W’ given by respectiveelements. For example, ‘DM’ corresponds to a dialogue managementcontext, similar to the dialogue manager 418 of FIG. 4B. In anotherexample, ‘DC’ corresponds to a domain classifier, similar to the domainclassifier 414 of FIG. 4B. In another example, ‘D’ corresponds to adevice identification, such as the device context 420 of FIG. 4B. Forexample, Equation 7 denotes that once a personalized language model 740is constructed for a particular user, based on the users latentfeatures, the language models (such as the language models 660 a-660 n),that are used to constructed the personalized language model 740 can beupdated. If the language models (such as the language models 660 a-660n) that are used to construct the personalized language model 740 can beupdated, a notification can be generated, notifying the personalizedlanguage model 740 to be updated accordingly. In certain embodiments,the personalized language model 740 is not updated even when thelanguage models (such as the language models 660 a-660 n) that are usedto construct the personalized language model 740 are updated. Thelanguage models 660 a-660 n, can be updated based on contextualinformation from dialogue management, domain, device, and the like.

FIG. 8 illustrates an example method determining an operation to performbased on contextual information, in accordance with an embodiment ofthis disclosure. FIG. 8 does not limit the scope of this disclosure toany particular embodiments. While process 800 depicts a series ofsequential steps, unless explicitly stated, no inference should be drawnfrom that sequence regarding specific order of performance. For example,performance of steps as depicted in process 800 can occur serially,concurrently, or in an overlapping manner. The performance of the stepsdepicted in process 800 can also occur with or without intervening orintermediate steps. The method for speech recognition is performed byany of the client devices 104-114 of FIG. 1, the server 200 of FIG. 2,the electronic device 300 of FIG. 3, the ASR system 400 of FIGS. 4A and4B, the electronic device 470 of FIG. 4C, and the server 480 of FIG. 4C.For ease of explanation, the process 800 for speech recognition isperformed by the server 480 of FIG. 4C. However, the process 800 can beused with any other suitable system.

In block 810 the server 480 identifies a set of observable features. Theset of observable features can include at least one classic feature andat least one augmented feature. The classic features can includebiographical information about the individual (such as age, gender,location, hometown, and the like). The augmented features can includefeatures that are acquired about the user, such as SMS text messages,social media posts, written reviews, written blogs, logs, environment,context, and the like. The augmented features can be derived by theonline footprint of the user.

In block 820 the server 480 generates a set of latent features from theset of observable features. To generate the set of latent features, theprocessor, generates a multidimensional vector based on the set ofobservable features. Each dimension of the multidimensional vectorcorresponding to one feature of the set of observable features. Theprocessor then reduces a quantity of dimensions of the multidimensionalvector to derive the set of latent features. In certain embodiments, thequantity of dimensions of the multidimensional vector is reduced usingan auto-encoding procedure. Auto-encoding can be performed by anauto-encoder neural network. The auto-encoder can be located on theserver 480, or another device such as an external auto-encoder or theelectronic device that receives the verbal utterance and associated withthe user such as one of the client devices 106-114.

In block 830 the server 480 sorts the latent features into one or moreclusters. Each of the one or more clusters represents verbal utterancesof users that share a portion of the latent features. Each clusterincludes verbal utterances associated with the particular latentfeatures that are mapped.

In block 840 the server 480 generates a language model that correspondsto a cluster of the one or more clusters. The language model representsa probability ranking of the verbal utterances that are associated withthe users of the cluster.

In certain embodiments, the language model includes at least firstlanguage model and a second language model. Each of the at least firstand second language models corresponding to one of the one or moreclusters, respectively. The server 480 can then identify identifying acentroid of each of the one or more clusters. Based on the identifiedcentroid, the server 480 constructs a first database based on the verbalutterances of a first set of users that are associated with a first ofthe one or more clusters. Similarly based on a second identifiedcentroid, the server 480 constructs a second database based on theverbal utterances of a second set of users that are associated with asecond of the one or more clusters. Thereafter, the server 480 cangenerate the first language model based on the first database and thesecond language model based on the second database. The server can alsogenerate the language model based on weighting the first and secondlanguage models.

In certain embodiments, the language model includes multiple languagemodels such as a at least first language model and a second languagemodel. Each language model corresponding to one of the one or moreclusters, respectively. The server 480 can acquire one or moreobservable features associated with a new user. After new observablefeatures are acquired, the processor identifies one or more latentfeatures for the new user based on the one or more observable featuresthat are associated with the new user. The server 480 can identifylevels of similarity between the one or more latent features of the newuser and the set of latent features that are included in the one or moreclusters. After identifying levels of similarity between the one or morelatent features of the new user and the set of latent features that areincluded in the one or more clusters, the server 480 generates apersonalized weighted language model for the new user. The personalizedweighted language model based on the levels of similarity between theone or more latent features of the new user and the one or moreclusters.

To generate the personalized weighted language model for the new user,the server 480 can identify a cluster that is below a threshold ofsimilarity between the one or more latent features of the new user andthe set of latent features associated with a subset of the one or moreclusters. In response to identifying the cluster being below thethreshold of similarity, the server 480 excludes a language model thatis associated with the identified cluster in generating the personalizedweighted language model for the new user.

Although the figures illustrate different examples of user equipment,various changes may be made to the figures. For example, the userequipment can include any number of each component in any suitablearrangement. In general, the figures do not limit the scope of thisdisclosure to any particular configuration(s). Moreover, while figuresillustrate operational environments in which various user equipmentfeatures disclosed in this patent document can be used, these featurescan be used in any other suitable system.

None of the description in this application should be read as implyingthat any particular element, step, or function is an essential elementthat must be included in the claim scope. The scope of patented subjectmatter is defined only by the claims. Moreover, none of the claims isintended to invoke 35 U.S.C. § 112(f) unless the exact words “means for”are followed by a participle. Use of any other term, including withoutlimitation “mechanism,” “module,” “device,” “unit,” “component,”“element,” “member,” “apparatus,” “machine,” “system,” “processor,” or“controller,” within a claim is understood by the applicants to refer tostructures known to those skilled in the relevant art and is notintended to invoke 35 U.S.C. § 112(f).

Although the present disclosure has been described with an exemplaryembodiment, various changes and modifications may be suggested to oneskilled in the art. It is intended that the present disclosure encompasssuch changes and modifications as fall within the scope of the appendedclaims.

What is claimed is:
 1. A method comprising: identifying a set ofobservable features associated with one or more users; generating latentfeatures from the set of observable features; sorting the latentfeatures into one or more clusters, each of the one or more clustersrepresenting verbal utterances of a group of users that share a portionof the latent features; and generating a language model that correspondsto a specific cluster of the one or more clusters, the language modelrepresenting a probability ranking of the verbal utterances that areassociated with the group of users of the specific cluster.
 2. Themethod of claim 1, wherein generating the latent features comprises:generating a multidimensional vector based on the set of observablefeatures, each dimension of the multidimensional vector corresponding toone feature of the set of observable features; and reducing a quantityof dimensions of the multidimensional vector to derive the latentfeatures.
 3. The method of claim 2, wherein the quantity of dimensionsof the multidimensional vector is reduced using an auto-encodingprocedure.
 4. The method of claim 1, wherein: the language modelincludes a first language model and a second language model, each of theat least first and second language models corresponding to one of theone or more clusters, respectively, and generating the language modelcomprises: identifying a centroid of each of the one or more clusters;constructing a first database based on the verbal utterances of a firstgroup of users that are associated with one of the one or more clusters;constructing a second database based on the verbal utterances of asecond group of users that are associated with another of the one ormore clusters; and after constructing the first database and the seconddatabase, generating the first language model based on the firstdatabase and the second language model based on the second database. 5.The method of claim 1, wherein the set of observable features comprisesat least one classic feature and at least one augmented feature.
 6. Themethod of claim 1, wherein: the language model includes one or morelanguage models, each of the one or more language models correspondingto one of the one or more clusters that represent the verbal utterances,and wherein the method further comprises: acquiring one or moreobservable features associated with a new user; identifying one or morelatent features of the new user based on the one or more observablefeatures that are associated with the new user; identifying levels ofsimilarity between the one or more latent features of the new user andthe sorted latent features; and generating a personalized weightedlanguage model for the new user, the personalized weighted languagemodel based on the levels of similarity between the one or more latentfeatures of the new user and the one or more clusters that representverbal utterances of groups of users that share a portion of the latentfeatures.
 7. The method of claim 6, wherein: the language model includesmultiple language models, and the method further comprises: identifyingone cluster that is below a threshold of similarity between the one ormore latent features of the new user and the latent features associatedwith a subset of the one or more clusters; and excluding one of themultiple language models that is associated with the one cluster that isbelow a threshold of similarity when generating the personalizedweighted language model for the new user.
 8. An electronic devicecomprising: a memory; and a processor operably connected to the memory,the processor configured to: identify a set of observable featuresassociated with one or more users; generate latent features from the setof observable features; sort the latent features into one or moreclusters, each of the one or more clusters represents verbal utterancesof a group of users that share a portion of the latent features; andgenerate a language model that corresponds to a specific cluster of theone or more clusters, the language model representing a probabilityranking of the verbal utterances that are associated with the group ofusers of the specific cluster.
 9. The electronic device of claim 8,wherein to generate the latent features, the processor is furtherconfigured to: generate a multidimensional vector based on the set ofobservable features, each dimension of the multidimensional vectorcorresponding to one feature of the set of observable features; andreduce a quantity of dimensions of the multidimensional vector to derivethe latent features.
 10. The electronic device of claim 9, wherein thequantity of dimensions of the multidimensional vector is reduced usingan auto-encoding procedure.
 11. The electronic device of claim 8,wherein: the language model includes a first language model and a secondlanguage model, each of the at least first and second language modelscorresponding to one of the one or more clusters, respectively, and togenerate the language model, the processor is further configured to:identify a centroid of each of the one or more clusters; construct afirst database based on the verbal utterances of a first group of usersthat are associated with one of the one or more clusters; construct asecond database based on the verbal utterances of a second group ofusers that are associated with another of the one or more clusters; andafter constructing the first database and the second database, generatethe first language model based on the first database and the secondlanguage model based on the second database.
 12. The electronic deviceof claim 8, wherein the set of observable features comprises at leastone classic feature and at least one augmented feature.
 13. Theelectronic device of claim 8, wherein: the language model includes oneor more language models, each of the one or more language modelscorresponds to one of the one or more clusters that represent the verbalutterances, and the processor is further configured to: acquire one ormore observable features associated with a new user; identify one ormore latent features of the new user based on the one or more observablefeatures that are associated with the new user; identify levels ofsimilarity between the one or more latent features of the new user andthe sorted latent features; and generate a personalized weightedlanguage model for the new user, the personalized weighted languagemodel based on the levels of similarity between the one or more latentfeatures of the new user and the one or more clusters that representverbal utterances of groups of users that share a portion of the latentfeatures.
 14. The electronic device of claim 13, wherein: the languagemodel includes multiple language models, and the processor is furtherconfigured to: identify one cluster that is below a threshold ofsimilarity between the one or more latent features of the new user andthe latent features associated with a subset of the one or moreclusters; and exclude one of the multiple language models that isassociated with the one cluster that is below a threshold of similaritywhen the personalized weighted language model for the new user isgenerated.
 15. A non-transitory computer readable medium embodying acomputer program, the computer program comprising computer readableprogram code that, when executed by a processor of an electronic device,causes the processor to: identify a set of observable featuresassociated with one or more users; generate latent features from the setof observable features; sort the latent features into one or moreclusters, each of the one or more clusters represents verbal utterancesof a group of users that share a portion of the latent features; andgenerate a language model that corresponds to a specific cluster of theone or more clusters, the language model represents a probabilityranking of the verbal utterances that are associated with the group ofusers of the specific cluster.
 16. The non-transitory computer readablemedium of claim 15, wherein to generate the latent features, the programcode, when executed by the processor, further causes the processor to:generate a multidimensional vector based on the set of observablefeatures, each dimension of the multidimensional vector corresponding toone feature of the set of observable features; and reduce a quantity ofdimensions of the multidimensional vector to derive the latent features.17. The non-transitory computer readable medium of claim 15, wherein:the language model includes a first language model and a second languagemodel, each of the at least first and second language modelscorresponding to one of the one or more clusters, respectively, and togenerate the language model, the program code, when executed by theprocessor, further causes the processor to: identify a centroid of eachof the one or more clusters; construct a first database based on theverbal utterances of a first group of users that are associated with oneof the one or more clusters; construct a second database based on theverbal utterances of a second group of users that are associated withanother of the one or more clusters; and after constructing the firstdatabase and the second database, generate the first language modelbased on the first database and the second language model based on thesecond database.
 18. The non-transitory computer readable medium ofclaim 15, wherein the set of observable features comprises at least oneclassic feature and at least one augmented feature.
 19. Thenon-transitory computer readable medium of claim 15, wherein: thelanguage model includes one or more language models, each of the one ormore language models corresponding to one of the one or more clustersthat represent the verbal utterances, and the program code, whenexecuted by the processor, further causes the processor to: acquire oneor more observable features associated with a new user; identify one ormore latent features of the new user based on the one or more observablefeatures that are associated with the new user; identify levels ofsimilarity between the one or more latent features of the new user andthe sorted latent features; and generate a personalized weightedlanguage model for the new user, the personalized weighted languagemodel based on the levels of similarity between the one or more latentfeatures of the new user and the one or more clusters that representverbal utterances of groups of users that share a portion of the latentfeatures.
 20. The non-transitory computer readable medium of claim 19,wherein: the language model includes multiple language models, and theprogram code, when executed by the processor, further causes theprocessor to: identify one cluster that is below a threshold ofsimilarity between the one or more latent features of the new user andthe latent features associated with a subset of the one or moreclusters; and exclude one of the multiple language models that isassociated with the one cluster that is below a threshold of similaritywhen the personalized weighted language model for the new user isgenerated.